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Abstract The Expectation-Maximization (EM) algorithm is a commonly used method for finding 
the maximum likelihood estimates of the parameters in a mixture model via coordinate ascent. A 
serious pitfall with the algorithm is that in the case of multimodal likelihood functions, it can get 
trapped at a local maximum. This problem often occurs when sub-optimal starting values are used to 
initialize the algorithm. Bayesian initialization averaging (BIA) is proposed as an ensemble method 
to generate high quality starting values for the EM algorithm. Competing sets of trial starting values 
are combined as a weighted average, which is then used as the starting position for a full EM run. 
The method can also be extended to variational Bayes (VB) methods, a class of algorithm similar to 
EM that is based on an approximation of the model posterior. The BIA method is demonstrated on 
real continuous, categorical and network data sets, and the convergent log-likelihoods and associated 
clustering solutions presented. These compare favorably with the output produced using competing 
initialization methods such as random starts, hierarchical clustering and deterministic annealing, 
with the highest available maximum likelihood estimates obtained in a higher percentage of cases, 
at reasonable computational cost. The implications of the different clustering solutions obtained by 
local maxima are also discussed. 

Keywords Bayesian model averaging • Expectation-maximization algorithm • Einite mixture 
models • Hierarchical clustering • Model-based clustering ■ Multimodal likelihood. 


1 Introduction 


The Expectation-Maximization (EM) algorithm is a commonly used coordinate as cent method for 
finding the maximum likelihood estimates for the parameters in a mixture model (jPemDster et al , 
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19771) . To initialize the algorithm, starting values are chosen for the allocations of observations to 


the available number of clusters and these lead to the initial calculation of the model parameters. 
Often this starting allocation is done on a random basis. However, once these parameter values have 
been specihed, the algorithm proceeds in an entirely deterministic fashion. 

An advantage of the EM algorithm is that it does not produce decreases in the likelihood 
function. A drawback of the algorithm is that it can converge to or get trapped for many iterations 
at a local maximu m, leading to failure to reach the global maximum and resulting in an inferior 
clustering solution. Zhou and Langel ( 201(J) describe this multimodality of the likelihood function 
as "one of the curses of statistics". Various convergence criteria can be used to stop the algorithm. 
However these criteria tend to characterize a lack of progress of the algorithm, rather than the fact 
that the global maximum has been reached. 

Many adaptations to the EM algorithm have been proposed that address this issue by al¬ 
tering its form. These includ e, among others, an EM algorithm featuring maximization using 


the Newton-Raphson method ( Redner^iid^^IkeiJ, 1198411 ; the Classification EM (CEM ) algorithm 


(iBiernacki et al , 2003t Celeux and Govaert 1992h: the Moment Matching EM algorithm ( Karlis and Xekalaki , 


2003 ); the Annealing EM algori thm I Zhou and Lang j . 2010l) : a hybrid EM/Gauss-Newton algo- 


rithm ( Aitkin a nd Aitkin . 19961) : the Expectati on Conditional Maxim ization (ECM) algorithm 
( Meng an d Rubin, 1992) - the emEM algorithm ( Biernacki et 2003 ); the Multicyc le EM algo¬ 
rithm ( Meng and Rubin , 19931) and the Sparse EM algorithm ( Neal and Hintonl . Il999^ . 

Alternatively this problem can be addressed by identifying high quality, as opposed to random, 
starting values for the algorithm in a n attempt to optimize the eventual clustering solution. Recent 
dev elopments of this app r oach i nclude O’Hagan et 11 (l2m2h , who develop a pyramid burn-in scheme, 
and Baudrv and Celeua ( 20151) . who propose a recursive initialisation strategy based on splitting 
a selected component from a previous clustering solution, in particular with a view to performing 
robust model selection. 

Bayesian initialisation averaging (BIA) is proposed as a new and complementary method in this 
spirit. To generate starting values using BIA, a trial number of E-steps and M-steps are run on 
each of a sequence of random starting positions to give an updated sequence of potential starting 
positions. Using the likelihood associated with each updated starting position, a corresponding 
weight is calculated and a new single overall weighted average starting position is formed. The EM 
algorithm is then run to convergence from this juncture. 

For some types of model, the log-likelihood is intractable, and inference using EM is not pos¬ 
sible. We demonstrate our approach, in a Bayesian setting, for one such model using variational 
Bayes (VB) methods, an EM-type algorithm based on an approximation of the m odel posterior . 
Related algorithms for this type of scenario include itera ted conditional models (Bes^ 19861) . 
the Broyden-Fletcher-Goldfarb-S hanno (BEGS) algorithm (Bvrd et al . 1995 : Zhu et al . 1997h and 
sequential Monte Garlo for VB ( McGrorv and Ahfockl . 2014h . The latter shares some features in 
common with annealing and pyramid burn-in. 

As an illustrative example of BIA, consider Figure [U which is based on a clustering model using 
a VB algorithm. (See Sections 12.31 and 13.2.11 for specific details). The dashed lines show the value 
of the (lower bound to the) log-posterior at each iteration using two sets of randomly generated 
suboptimal starting positions. Using these starting positions, the algorithm converges to distinct 
solutions which are similarly valued in terms of log-posterior. Using BIA, it is possible to obtain 
a new, superior set of starting values, based on a weighted combination of the value of the two 
positions after running the algorithm for just 20 iterations. Hence a superior clustering solution in 
terms of log-posterior (the solid line) may be obtained. 
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Fig. 1 Illustrative example of the Bayesian initialisation averaging approach applied to a clustering model for 
network data using an EM-like algorithm. The dashed lines show the value of the lower bound to the log-posterior 
at each iteration using suboptimal starting positions; by using BIA, a superior clustering solution (the solid line) is 
obtained. 


BIA starting values are used as an initialization method to cluster a variety of continuous, 
categorical and network data examples using mixtures of Gaussians, latent class analysis (LCA) 
and the stochastic blockmodel (SBM), implemented via the EM (mixture of Gaussians and EGA) 
and VB (SBM). Even for the more complicated SBM approach where the likelihood can only be 
approximated, the proposed BIA method fares well. We demonstrate the approach on real and 
simulated data sets in comparison to competing methodologies such as random starts, hierarchical 
clustering, pyramid burn-in schemes and deterministic annealing. 

Section [2] presents the data sets used as motivating examples to illustrate the benefits of the 
proposed method. Section [3] gives a overview of how the EM algorithm is used for fitting mixture 
distributions. The calculations for the E-step and M-step are provided for different types of appli- 









4 


Adrian O’Hagan, Arthur White 


cation. A description of how to apply BIA to generate EM algorithm starting values is detailed. A 
label switching problem that is encountered, as well as a means of overcoming it, are documented. In 
SectionUthe outcomes of the methods tested on the data sets referenced in Section[5]are described. 
Convergent log-likelihood plots and clustering solutions are presented for BIA and its competing 
methodologies in each case. Section [S] details the conclusions reached and summarizes the main 
findings. 


2 Illustrative Data Sets 


Three types of data set are used as motivating examples in testing the BIA method. The first type 
i s mul tivariate continuous data (the AIS data set and a simulated data set taken from iBaudrv et al 
( 2 OI 5 II I; the second type is multivariate categorical data (the Carcinoma and Alzheimer’s data 
sets); the third type is network data (the Karate data set). These are described below. 


2.1 Australian Institute of Sport (AIS) data and simulated data 


The Australian Institute of Sport data set contains biometric observations o n n = 202 Australian 
athle tes (102 males and 100 females) at the Australian Institute of Sport (jCook and Weisberd . 
199411 . The full data set contains 13 variables (11 continuous and 2 discrete) and all continuous 
variables are included in the analysis: red cell count (rcc), white cell count (wcc), hematocrit (He), 
hemoglobin (Hg), plasma ferritin (Fe), body mass index (bnii), sum o f skin folds (ssf), body fat 
percentage (Bfat), lean body mass (Ibm), height (Ht) and weight (Wt) ( Cook and Weisberd . Il993l 
(see Figure [2]) • 

A further simulated data set taken from iBaudrv et al ( 2015ll experiment 3 is tested, namely 200 
observations simulated from a mixture of three multivariate Gaussian components with diagonal 
covariance structure. The second and third horizontal components (plotted as red crosses) overlap, 
giving the appearance of forming a single cluster, as seen in Figure [H The first component, plotted 
as black circles, is vertical in nature. For this data set, both a 2 group and a 3 group solution are 
relevant. 100 such data sets are randomly simulated for the purposes of the study. 


2.2 Carcinoma and Alzheimer’s data 


Both the Carcinoma and Alzheimer’s data sets consist of multivariate binary data, to which LCA 
is applied. The Carcinoma data set contains dichotomous ratings by m = 7 pathologists of n = 118 
slides for the presenc e or absence of carcinom a in the uterine cervix. The dat a set is availab le in the 
R package poLCA ( Linzer and Lewisl . I 2 OIIII and is described at length bv lAgre^ 1 20021 Section 
13.2). As per the simple percentage diagnosed summary for each doctor in TablelU the data clearly 
demonstrate that some pathologists are more prone to diagnosing the presence of carcinoma than 
others. Usin g LCA we aim to ident ify sets of observations for which they agree and disagree. The 
approach of Zhou and Langel ( 2010ll is followed and the case of G = 4 classes considered. 

The Alzheimer’s data set documents the presence or absence of m = 6 symptoms in n = 240 
patients diagnosed with early onset Alzheimer’s disease, conduc ted in Saint James’ Hospita l, Dublin, 
Ireland. This data is a vailable in the R p acka ge BayesLC A ( White and Murphvl . I20l3l . and was 
previously analyzed bv iMoran et all ( 2004h and Walsh ( 20061) . where the goal was to identify whether 
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Fig. 2 Pairs plot of the Australian Institute of Sport (AIS) data. 


Table 1 Percentage of slides identified as having cancer present by the seven different doctors in the Carcinoma 
study. 


Pathologist 

A 

BCD 

E F G 

Detected (%) 

0.56 

0.67 0.38 0.27 

0.60 0.21 0.56 


groups of patients with the condition presented distinctive patterns of symptoms. The case where 
G = 3 latent classes are assumed to be present in the data is investigated. 


2.3 Karate data 


Zachary’s Karate club data set records the social network of fr iendships betwe en n = 34 members 
of a karate club at a United States university in the 1970s (|ZacharvL 1197711 . While friendships 
within the network initially arose organically, a dispute between a part time instructor and the club 
president led to two political factions developing, with the members taking the side of either the 




































































































































































































































































































































































































































































6 


Adrian O’Hagan, Arthur White 



Fig. 3 Pairs plot of the simulated mixture of Gaussians data. 


president or the instructor, ultimately leading to the formation of a new organisation. A G = 4 
group SB M is applied to this data. The net work is visualised in Figure 0] using the Fruchterman- 
Reingold dFrucht erman and Reingoldl Il99lh layout algorithm available in the R package igraph 


( Csardi and Neousz . 2006fl . 


3 Methods 

3.1 The EM Algorithm 

The EM algorithm is a powerful computational technique for finding the maximum likelihood es¬ 
timates (MLEs) of the parameters in a mixture model when there are no closed-form maximum 
likelihood estimates, or when the data are in c omple te ( McLachlan and Krishnail 1997 1. The EM 
algorithm was introduced bv iDempster et'3 (1977) and has many different applications, includ¬ 
ing cluster analysis, censored data modelling, mixed models and factor analysis. In model-based 
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Fig. 4 Plot of the Karate network data set. 


clustering, the EM algorithm maximizes the expected complete data log-likelihood to produce up¬ 
dated parameter values. Ultimately this process maximizes the observed data log-likelihood. The 
algorithm involves two steps, the E or expectation step and the M or maximization step. Initial 
starting values are required for the model parameters and the E and M steps are then iterated 
until convergence. In terms of initializing the algorithm the aim is to generate good-quality starting 
values that will lead to the algorithm converging to the highest possible log-likelihood. 


3.2 Fitting mixture models using the EM Algorithm 

Let X be the matrix of all n observations and the m—dimensional vector = xn ,..., Xim denote 
the value of the observation, for i = l,2,...,n. Each observation belongs to one of G groups. 
Zig = 1 when observation i belongs to group g and Zig = 0 otherwise. Group labels are not known 
in advance. Z is an (n x G) classification matrix of Zig values for the matrix X. Tg is the probability 
that Xi belongs to group g. Bg denotes the distributional parameters for group g. The composition of 
the parameter set 6 varies depending on the application at hand; for example, in a continuous data 
setting where a mixture of Gaussians is fitted, 6 could represent the group means and covariance 
matrices. Further examples are given in Section 13.2.11 
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If each observation belongs to one of G groups with probability Tg then the density of x^, 
conditioning on the distributional parameters 6 and group membership probabilities t, is given by: 

G 

Y.[rgP{^i\eg)] ( 1 ) 

The observed likelihood is the product of the densities of the x^, conditioning on 6 and r: 

n G 

L = p{x\T,e) = l[J2[TgP{^,\eg)]. (2) 

i=l 3=1 

This function is difficult to work with as its natural logarithm cannot be differentiated in a straight¬ 
forward manner. However, introducing the Z classification matrix, the complete data likelihood Lc 
can be expressed as: 

n G 

L, = P(X, Z|r, 0) = n n [^9 ^ (Xi| 03 )]'^“ • (3) 

j=l 9=1 

The complete data log-likelihood Ic can then be written as: 

n G 

Ic = '^'^Z^g [log Tg+ log P{Xi\9g)]. ( 4 ) 

i=l 3=1 

The EM algorithm maximizes the observed log-likelihood, m, according to the following process: 
(i) Starting Values: Set Z*^°) at iteration t = 0. Calculate the values of and 9^^^ based on 

z(°). 


(ii) E-Step: Evaluate = E [4(0, t, Z)|0*^‘\ , which means estimating E{zig): 


p(i+l) 

*9 


= E 


4(‘+iy 

^ig 



n' = ^ 


(5) 


(iii) M-Step: Maximize (Equation |6]) with respect to the group membership probabilities, 

r, and the distributional parameters of the observed data, 9. This produces new values 
(see Equation O and where the composition of 0^*+^) varies according to the type of 

data: 


n G 

Qit+i) ^ eg+1) [logTg + logP(x,|0,)] (6) 

i=l 3=1 
n 

^{t+1) _ i V 


(7) 
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(iv) Convergence Check: The algorithm is stopped when the log-likelihood has converged. Con¬ 
vergence is deemed to have been reached when the relative change in the log-likelihood is 
“sufficiently small”: 




< e, 


( 8 ) 


where e is a suitably small value specified by the user. For the continuous data and SBM 
applications {AIS and Karate data respectively), e was chosen as 1 x 10“® to keep the method 
consistent with alternative approaches such as hierarchical clustering initialization in mclnst 
( Fraley and Rafter^ . 1999ll . which use the same convergence criterion. For the LCA examples 
{Carcinoma and Alzheime r’s data), e was chosen to be 1 x 10“®, in keeping with previous 
experiments performed by IZhou and Lan^ ( 2010h . The EM algorithm parameter estimates 
at convergence are used as a means of clustering the data set. 

A common problem with the EM algorithm is that it can get trapped at a local maximum, 
leading to an inferior clustering solution. A computationally intensive remedy is to re-run the al¬ 
gorithm from many random Z starting positions to convergence and then select the solution with 
the highest log-likelihood. Other approaches include specifying starting values by first using some 
type of deterministic heuristic algorithm, or employing an annealing approach. Alternatively, it is 
possible to optimize starting values by combining multiple random starts together. Bayesian initial¬ 
ization averaging (BIA) provides a means of achieving this goal by using a weighted combination 
of Z starting positions. 

While the objective is generally to hnd the global log-likelihood maximum, it must be noted that 
this mode may not be optimal in terms of how meaningful the clustering is. In models where there 
are no restrictions on the covariance structure the maximum log-likelihood is actually infinite. Ad¬ 
ditionally, some spurious solutions will yield convergent log-likelihoods that are “too high”, typically 
in cases where addit i onal c omponents are fitted to capture small groups of outlying observations. 
iMcLachlan and Peell ( 200fll l document this phenomenon for mixtures of normal distributions. Eor 
the AIS, Carcinoma, Alzheimer’s and Karate data sets, spurious solutions are not an issue due to 
the number of clusters considered, which are in turn guided by previously published results under 
competing methodologies. Overall, any reference to the “global mode” should be interpreted as re¬ 
ferring to the highest hnite log-likelihood that emerges from a meaningful clustering solution in a 
stable subset of the parameter space. 

On a related note, it is worth recognising that simply ranking models by convergent likeli¬ 
hood is not on its own proof of superior clustering fit and hence a superior initialization scheme. 
However, for the motivating data sets presented, the clustering solutions corresponding to higher 
convergent likelihoods at a very minimum give valuable insight as to alternative interpretations 
of how the data can be clustered; which can be more intuitive or insightful. This point is ana¬ 
lyzed by McLachlan and Peell (2000), who argue that for continuous data the inter-component sum 
of squared distances to the cluster mean tends to improve (decrease) as convergent log-likelihood 
increases. 


3.2.1 Examples of mixture models 

The following mixture models were used to cluster the data sets described in Section [2] 
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Mix ture of Gaussians In the continuous data setting, where € K™, a mixture of Gaussians 
( Fraley and Rafterv . 1999ll can be fitted. Here 9 represents the group means and the group 
variances/covariances cr or S, depending on whether the data is univariate or multivariate 
respectively. Several possible covariance structures are available; see Fraley and Rafterv ( 1999ll 
for further details. 

Latent Class Analysis LCA ( Goodman . Il974ll involves clustering of categorical (specifically nomi¬ 
nal) data, which in the simplest case is binary in nature denoting, e.g., the presence or absence 
of a symptom. In this setting, each Xim S {0,1}, and 0 is a G x M matrix often referred to as 
the item probability parameter, that is, 0gm denotes the probability, conditional on membership 

of group 5 , that Xim = 1, for any z G 1 ,..., n, _ 

Stochastic Blockmodel Social network analysis ( Wasserman and Faustl . Il994 1 is the study of how 
links, such as friendship, between a set of actors are formed. Here, X is an n x n matrix such 
that 


Xii — 


1 if a link exists between actors and 
0 otherwise. 


The SBM ( Holland et all . Il983 : Sniiders and Nowicki . 1997 : Daudin et al . 2008ll introduces G 
latent groups underlying the network. Here, 0 is a G x G connectivity matrix that, conditional 
on their group membership, represents the probability of a link being formed between two actors. 
Note that inference in this setting is not as straightforward as for the mixture of Gaussians and 
LCA cases, and the EM algorithm cannot be used directly for this model. This is because 
the observed data likelihood func tion is not available in closed fo r m, an d an approximation is 
required; see Daudin et all ( 2008f) for further details. IVolant et al ( 2012 1 use a VB method to 
perform inference for the SBM. This approach introduces a tractable lower bound £ < ^ to the 
observed data log-likelihood which can itself be maximised in an EM-like iterative manner. To 
make use of this method requires specifying prior distributions. In the example in Section 14.51 
uninformative (proper) priors were chosen. 


The LCA and SBM approaches make a conditional independence assumption ([Hand and Yul . 


200111 ■ whereby, conditional on group membership, the data points are assumed to be independent. 
Mixture models also assume conditional independence - the covariance structure models dependence 
between variables as opposed to data points. 


3.3 Generating Starting Values 
3.3.1 Random Allocation 

A simple method for initializing the EM algorithm is to randomly allocate observations to the 
available groups. In the examples in Section |T1 unless otherwise stated, the algorithm is initialized 
according to the following method: ~ Multinomial(1,1/G,..., 1/G), for each i G I,..., n. 

Alternatively, the algorithm can be initialized by specifying random values for the parameters 0 
and T. For example, in a Bayesian setting, these could be generated from the model priors. However, 
following experimentation, superior performance was achieved by randomly generating values for 
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3.3.2 Hierarchical Clustering Starting Values 

For multivariate continuous data sets, mclust utilizes hierarchical clustering to generate a starting 
Z value for the EM algorithm. Hierarchical clustering constructs a tree-like structure known as 
a dendrogram to show groups of observations. The final clustering is built up over a number of 
steps, where similar observations are joined together. Several measures can be used to calculate the 
distance or dissimilarity between observations, with Euclidean distance the most common. There 
are also many criteria of aggregation that may be nsed such as the single linkage criterion, the 
complete linkage criterion and Ward’s method ( Ward . 1963h . The observations with the lowest 


dissimilarities are grouped together first and the process then iterates on the group level until all 
observations belong to a single group and the dendrogram mapping the process is constructed. Once 
the tree is formed, the number of groups identified as optimal depends on where the tree is cut. 
The dendrogram is conventionally cut where there is relatively wide range of distances over which 
the number of clusters in the solution does not change. An advantage of hierarchical clustering is 
that the number of clusters is not assumed in advance. However the starting valnes yielded are 
not necessarily optimal in terms of eventually achieving the global maximnm log-likelihood. BIA 
represents an alternative to this initialization method for generating better-qnality starting valnes 
to facilitate ultimately obtaining the highest possible convergent log-likelihood. 


3.3.3 Burn-in Methods and Parameter Targeting 


O’Hagan et all ( 2012 1 detail two novel EM algorithm initialization schemes for model-based cluster¬ 
ing of multivariate continuous data. The “burn-in” methods proposed start with a set of candidate Z 
matrices, run each Z for a number of EM iterations, rank the Zs according to likelihood and retain 
only a certain fraction of the candidate Zs for the next set of EM runs, until only one Z remains. The 
parameter targeting method evalnates which parameter(s) in the model are most effective in driving 
the likelihood uphill at consistent intervals and focuses M steps on those parameters, significantly 
reducing the overall computational burden. BIA represents a significantly faster and statistically 
more robust alternative to these schemes. 


3 . 3.4 Deterministic Annealing 


An alter native means of identifying a global likelihood maximnm is provided by deterministic an¬ 
nealing ( Uedal . 19981 : Zhou and Lange . 2011)1 1. Under this approach, a positive tuning parameter v 
is used to flatten the likelihood function, with the adapted likelihood rendered easier to explore 
and ultimately identify the global maximnm using the EM algorithm. If implemented correctly, the 
algorithm should then obtain the global maximum regardless of the initial starting values. 

In the case of finite mixture models, the approach is implemented by setting 


£(i+l) 

■'ig 




,(«) 


G 

E 

s' = l 




( 0 ^ 
g' > 




where is initially set to some small valne, e.g., = 0.05, then increased every s iterations 

towards v°° = 1. The positively valned r determines the rate of convergence to v°° ^ by setting 
j^(k-i-i) _ ^^(k) _j_ Note that v^,r and s must all be specified by the user. 



























12 


Adrian O’Hagan, Arthur White 


3.4 Bayesian Initialization Averaging 


In standard statistical practice, a single model is selected from a collection of competing models as 
the "best" model, using a m etric such as BIG, a nd inferences are made on the basis of this "best" 
model being the true model ( Rafterv et all 2005ll . Consequently the element of model uncertainty is 
often ignored. As a general fram ework, Bayesian mod el averaging (BMA) can provide a mechanism 
to account for this uncertainty ( Hoeting et al . [199911 . By accounting for model uncertainty, BMA 
mi nimizes prediction r i sk and has also been s how n to improve model predict ion accuracy on average 
bv lWintle et all ( 2003h . Volinskv et all ( 1997t l and Murohv and Wana ( 2001 1. among others. It does 
this by averaging over competing models to provide broader conclusions about parameter estimates 
than from using a single model. Usually BMA is used after models are fitted in order to weight over 
the convergent results of multiple models. Conversely, its use is now proposed as an initialization 
method to generate starting values for the EM algorithm in an attempt to maximize the convergent 
log-likelihood. 

The rationale for using a BMA-like approach to initialization is as follows: by averaging candidate 
starting values according to their relative quality of fit at a pre-convergence stage in running the 
EM algorithm, the overall weighted starting value is then capable of matching or exceeding the 
performance of any single individual candidate starting value, as it combines information on the 
allocation of observations to clusters from multiple evolutions of the model-fitting process. From 
a computational perspective, the approach allows a large number of candidate solutions to be 
proposed and evaluated in an efficient manner. 

The following steps are followed to apply Bayesian initialization averaging (BIA) to generate 
the EM algorithm starting values for model-based clustering. In the case of the SBM, the lower 
bound C is used as a substitute for 1 . 


(i) 

(ii) 


(iii) 


Specify a model structure or set of candidate models. 

Run a small number of preliminary E-steps and M-steps on each of a series of randomly 
generated Z starts to give an updated sequence of Z matrices, using the model structure(s) 
identified in (i). 

Calculate the value of an information criterion similar in nature to the Bayesian Information 
Criterion ( Schwarz . 1978h associated with each updated Z matrix. BIC*^. is the value of this 
criterion associated with the Z matrix, calculated as: 


BIC*z^=-2lz^+p,\og{n). (9) 

where Iz is the observed log-likelihood for the 7i matrix, pj is the number of parameters 
in the model and n is the number of observations. The key difference between BIC* 
and the standard formulation is that the criterion is not evaluated at the maximum observed 
likelihood, since this quantity will not yet have been identified. Despite not operating at the 
maximum observed likelihood, empirically it is found that the penalty term pj log(n) remains 
an effective discriminator between models of varying complexity when assigning weights to 
the competing starting positions. 

(iv) Re-scale the BIC* values by subtracting the maximum value from each of the individual 
elements. This renders the results computationally tractable for evaluation in the remaining 
steps. 
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(v) Calculate the weight for the Z matrix, Wj, for BIA using the formula: 

exp(-0.5i?/C|^) 

Evfcexp(-0.5B/C|J- ^ ^ 

(vi) Form a single new Z matrix, Z*, as the weighted average of the Zj matrices using the values 
calculated by equation [10] as the weights: 


vt 

(vii) Run the EM algorithm to convergence using the single BIA weighted starting Z* matrix 
formed in (vi). 


3.5 The Label Switching Problem 


A label switching problem is encountered when applying the BIA method, prior to implementing 
step (vi) above. The maximum likelihood label configurations from the runs of EM algorithm are 
invariant to permutations of the data labels. Hence, if the BIA weighting process is applied to a 
large number of candidate Z matrices that have undergone a preliminary set of EM steps, the result 
is merely a single Z matrix with roughly equal membership probabilities for all observations across 
all available groups. 

The method employed to undo the label switching problem is a “soft” label switching method. 
This method involves u sing the actual Z va lues or "soft" group membership probabilities associated 
with each observation ( Slonim et al [2005 1. Hence observations on the boundaries between several 
clusters are not forced to fully belong to one of the clusters, but rather are assigned membership 
percentages indicating their partial membership ( Rokach and Maimon . 2010l) . The matchClasses 
function in R was used to switch the labels. The matchClasses function takes the actual values in 
the array of Z matrices and matches them with a fixed row in the Z matrix to fix the labels. Each 
row of the matrix is sequentially matched versus the fixed row until the overall labelling structure 
is repaired. Using the actual Z probabilities avoids the loss of information associated with label¬ 


switchi ng methodologies that opera te on the “hard” group labe l s, suc h as that associated with the 
work of iCarpaneto and Toth! ( 1980h and iNobile and Fearnsid j ( 2007 ). However, for completeness. 


the latter “hard” label-switching methodology was checked for consistency in the continuous data 
case and deviations in outcomes versus the “soft” approach were found to be minimal. The method 
is also robust in the presence of empty clusters, though this issue is not encountered for any of the 
motivating data sets considered. While it is true that no method for correcting for label switch¬ 
ing is guaranteed to provide an optimal solution under all circumstances, the results presented 
demonstrate that the “soft” approach performs well for the real and simulated data considered. 


4 Results 

The results in this section were obtained using a varying number of initial Z matrices and prelim¬ 
inary EM runs for each data set and both label switching methods. The BIA code was hrst run 
for a range of different values of the number of initial Z matrices and the number of preliminary 
EM runs and the optimum combination was then used to produce the results below. For some 
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data sets, this number of initial Z matrices and preliminary EM steps may seem computationally 
costly. However, superior BIA performance can still be witnessed over competing methods at a 
much lower computational cost by reducing the number of Zs and preliminary EM steps, but not 
as frequently. Under the method of random starts, all randomly generated Z matrices are run until 
full convergence of the EM algorithm. The early iterations of any EM algorithm are usually the 
most profitable in terms of driving the likelihood function uphill. The BIA method exploits this 
since it focuses computational power on these early iterations prior to weighting the processed Z 
matrices to form a single Z for use in the EM algorithm. 


4.1 Australian Institute of Sports (AIS) data 


Figure [ 5 ] displays a histogram of the convergent log-likelihoods for the 11 continuous variables of 
the AIS data set, using BIA, with both 40 initial Z matrices and 50 preliminary EM runs (Figure 
5(6)) and 50 initial Z matrices and 100 preliminary EM runs (Figure [5(c)[) respectively. The former 


gives a similar distribution of convergent log-likelihood v alues to that obtained using the previously 
developed pyramid burn-in methodology developed by ( O’Hagan et al [ 2 OI 2 II in Figure 5(a) The 
latter gives a superior distribution of convergent log-likelihoods. The model applied has G = 2 
multivariate normal components and ellipsoidal covariance structure with equal shape and volume 
(EEV in standard mclust nomenclature) as selected by BIG. The purple vertical line represents the 
optimal log-likelihood reached by mclust ’s hierarchical clustering initialization with G = 2 groups 
and EEV covariance structure. The “soft” label-switching methodology is employed in this instance, 
but there is minimal effect on the results of alternatively using the “hard” label-switching approach. 
The BIA approach is extremely f ast to run, requ i ring o nly 5% of the processing time needed by the 
alternative schemes developed bv IO ’Hagan et aJ ( 2012ll for the AIS data. This shows that BIA has 
potential to be applied to data sets with large numbers of observations and variables, or to more 
complex mixture models for continuous data. For thoroughness, the BIA method w as also applied 
to the Virginicas, Galaxies and Hidalgo data sets used as motivating examples by lO ’Hagan et al 
( 2 OI 2 II . As was the case with the AIS data, the distribution of convergent log-likelihoods was 
comparable to or better than that found using the competing burn-in and parameter targeting 
methods, but at considerably lower computational cost. 


Figure [6] displays the pairs plot for body mass index (bmi) and body fat Percentage (Bfat) under 
the clustering solution obtained using a hierarchical initialization of the EM algorithm (convergent 
log-likelihood of —4,743.6). Compared to the optimal convergent log-likelihood that is regularly 
identified by the BIA method (—4,722.4), 15 data points change cluster membership, highlighted 
as crosses. The change in clustering solution produces a shift in the total within-cluster sum of 
squared distances to the cluster mean from 619,110 for the convergent log-likelihood identified by 
hierarchical clustering initialization to 606,377 for the global mode identified by BIA. This example 
shows that converging to a higher log-likelihood can change the optimal clustering solution of a 
data set and provide a more intuitive result with clearer separation between groups. Increasingly 
higher log-likelihoods at convergence may be associated with increasingly more marked divergences 
in clustering solutions, which at a very minimum should be evaluated for the insights they offer. 
This demonstrates the importance of converging to the highest possible meaningful log-likelihood 
and the inherent value of the BIA method. 
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Fig. 5 AIS data: histogram of convergent log-likelihoods with G = 2 a nd ellipsoidal covarian ce structure with 
equal shape and volume (EEV) using (a) pyramid burn-in as detailed in llO ’Hagan et all l2012tl (b) BIA with 40 
initial Z matrices and 50 preliminary EM runs (c) BIA with 50 initial Z matrices and 100 preliminary EM runs. 
The purple vertical line at —4743.6 represents the optimal log-likelihood reached by mclust’s hierarchical clustering 
initialization. 


4.2 Simulated mixture of Gaussians data 


When a multivariate mixture of Gaussians model w ith the most general covariance structure is fitted 
to the simulated data as per iBaudrv et all (120151) , hierarchical clustering initialization identihes a 
3 group solution as optimal in terms of BIG for 98 of the 100 data sets tested. The same mixture 
of Gaussian distributions initialized using BIA identifies a 3 group solutions as optimal in terms 
of BIG in 49 of the 100 cases and a 2 group solution as optimal in the remaining 51 cases. This is 
preferable in such a case where both the 2 group and 3 group solutions are viable, demonstrating 
to the user that the underlying horizontal and vertical data structure can be recreated with one 
less component if so desired. Hence BIA can be seen to offer similar b enefits to the Super vised 
Integrated Gomplete Likelihood (SIGL) decision criterion introduced in iBaudrv et all (I201511 . but 
rather than change the model decision criterion, BIG is retained and instead a different set of 
decisions is reached via superior initialization of the model. 


4.3 Garcinoma data 

For each of 100 randomly generated starting values, the EM algorithm was run to convergence for 
the Carcinoma data and multiple local maxima were uncovered. The distribution of convergent 
lower-bounds to the log-likelihood is shown in Figure [T] Parameter estimates obtained by the LCA 
model are visualised in Figures [H These represent the estimated item probabilities 6 as heat maps, 
with values close to 1 represented by a lighter grey colour, and values close to 0 with darker grey. 
The estimated class size f is represented by the relative length of the cells in each row. 

As can be seen in FigurelSl the parameter estimates corresponding to convergent log-likelihood 
—289.79 and convergent log-likelihood —289.29 are highly similar, however, a different interpretation 
of the data corresponds to convergent log-likelihood —291.27. This illustrates the importance of not 
converging to the sub-optimal mode. 
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Fig. 6 AIS data: Pairs plot for body mass index (bmi) and body fat Percentage (Bfat) showing the clustering 
solution using mclust’s hierarchical clustering initialization. The 15 observations that change cluster membership 
when using the optimal likelihood identified by BIA are overlaid with crosses. 


As seen in Tabled BIA achieves the global maximum convergent likelihood or close to it almost 
90% of the time, using 30 random starts and 10 iterations of the algorithm before BIA is applied. 
This is a marked improvement over the outcome when random starts are employed. Computationally 
the approach also performed better, with a single run of the BIA algorithm requiring about one 
third of the time of running to convergence 100 random starting positions. 


Table 2 Distribution of convergent log-likelihoods for 100 initializations of the Carcinoma data using BIA and 
random starts. 


Log-likelihood 

-294 

-293 

-292 

-291 

-290 

-289 

BIA 

0 

0 

1 

10 

46 

43 

Random Starts 

1 

3 

21 

12 

40 

23 
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Fig. 7 Histogram of distribution of convergent log-likelihoods using 100 random starts for the Carcinoma data. 
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Fig. 8 Visualisation of different parameter estimates obtained by local maxima for the LCA model applied to the 
Carcinoma data. 
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Fig. 9 Histogram of distribution of convergent log-likelihoods using 100 random starts for the Alzheimer’s data. 


4.4 Alzheimer’s data 

A 3 group LCA model was applied to the Alzheimer’s data from 100 randomly generated starting 
values. The distribution of the convergent log-likelihoods is shown in Figure [51 the global maximum 
is only obtained in 18 cases. Table [3] displays the differing solutions arising from applying LCA to 
the Alzheimer’s data at convergent log-likelihoods of —745.7 and —743.5 respectively. Again, the 
latter solution produces a significant alteration in interpretation of symptom composition in the 
three different classes, with ramifications for subsequent patient classification. 


Table 3 Parameter estimates for the 3 group LCA model fitted to the Alzheimer’s data. Firstly, the sub-optimal 
convergent log-likelihood —745.7; and secondly, the global convergent log-likelihood —743.5. 



rg 

Hallucination 

Activity 

Aggression 

Agitation 

Diurnal 

Affective 

1 

0.39 

0.13 

0.70 

0.28 

0.00 

0.30 

0.78 

2 

0.31 

0.03 

0.46 

0.00 

0.18 

0.04 

0.51 

3 

0.30 

0.07 

0.80 

0.41 

1.00 

0.38 

0.97 

1 

0.51 

0.06 

0.52 

0.06 

0.13 

0.10 

0.55 

2 

0.47 

0.10 

0.79 

0.37 

0.60 

0.37 

1.00 

3 

0.02 

0.00 

0.82 

1.00 

0.21 

1.00 

0.00 


The performances of the BIA and annealing methods are compared to that of random starts in 
TablejH Each method was run 100 times. In the case of BIA, 20 random starts run for 200 iterations 
were averaged. In this instance, the approach was computationally slower, a single run of the BIA 
algorithm run taking about 80% longer than separately running a single run to convergence 100 
times. 
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For the annealing method, after some experimentation, the tuning parameters were set to be 
= 0.12, r = 0.87 and s = 10. While the BIA approach produces an improved performance, 
with the global maximum being achieved in 67% of cases, the annealing method converges to the 
global maximum only once. Despite an extensive search of the tuning parameter space, it proved 
impossible to substantively improve this performance. 


Table 4 Distribution of convergent log-likelihoods, rounded to the nearest integer, for 100 initialisations of the 
Alzheimer’s data using random starts, annealing and BIA. 


Log-likelihood 

-748 

-745 

-745 

-744 

-743 

Random Starts 

3 

51 

27 

1 

18 

Annealing 

0 

97 

2 

0 

1 

BIA 

0 

13 

20 

0 

67 


This is demonstrated in Figure [TOl which shows heat maps of the convergent log-likelihoods 
obtained, firstly using deterministic annealing for different settings of and r and secondly using 
BIA for different numbers of random starts and EM algorithm iterations per random start. Despite 
using 400 different parameter combinations, the global mode is rarely discovered using deterministic 
annealing^ Conversely, once the BIA method is allowed to initially run for more than 110 iterations, 
the global mode is found consistently, suggesting that the clustering performance of BIA is not as 
sensitive to the choice of initialization parameters. 

A possible explanation for the poor performance of the annealing method is that in the case of 
the Alzheimer’s data the suboptimal solution consists of roughly equal sized groups; the opposite 
is true in the Carcinoma example, where the method’s performance is superior to that of BIA. It 
appears that, unless its tuning parameters are very precisely specified, the annealing method may 
be biased towards solutions where cluster sizes do not vary significantly. The propensity for the 
competing methodologies to attain the global maximum convergent log-likelihood is tested across 
100 matching random starting positions, with the results presented in Tabled) The BIA approach 
outperforms the basic random starts and annealing methods. 


4.5 Karate data 

Obtaining the global maximum for SBM applied to the Karate data set when G = 4 using random 
starts proves to be challenging; over the course of 200 different runs, the maximum was reached 
only 4 times. A histogram of the various local maxima that can be obtained is shown in Figure 
[TT] Conversely, the BIA approach, using 200 starting values run initially for 15 iterations, obtains 
the global maximum 19 times out of 20. These results are shown in Tabled) The BIA approach is 
also much more computationally efficient in this instance; a complete pass of the BIA algorithm is 
approximately 30 times faster than the process of having a single starting position run to convergence 
200 times. 

The clustering that obtains the global maximum is visualised in Figure [121 Made clear is the 
division of the group into two political factions, with each faction having leaders (the nodes labelled 
2 and 3, coloured medium-dark and medium-light grey ) and followers (the nodes coloured light 

^ Using r = 0.94 or 0.95 regularly resulted in a saddle point with the value of the log-likelihood converging to -770. 
These results have been omitted from Figure [TOl 
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Fig. 10 Heat map of convergent log-likelihoods obtained using (left) deterministic annealing with different tuning 
parameter settings and (right) BIA with different tuning parameter settings, for the Alzheimer’s data. A light grey 
colour indicates an optimal value of the convergent log-likelihood has been obtained. 


Table 5 Distribution of convergent lower bounds to the log-likelihood C for the Karate data. For the random starts 
approach, 200 initialisations were used, while the BIA approach was implemented 20 times. 


c 

-204 

-197 

-195 

-194 

-193 

-188 

-187 

-184 

-180 

Random Starts 

1 

34 

45 

10 

83 

19 

3 

1 

4 

BIA 

0 

0 

0 

0 

0 

0 

1 

0 

19 


and and dark grey, labelled 1 and 4), which mainly connect to their respective leaders. Note that 
despite being the two largest groups, no links are shared between pairs of nodes labelled 1 and 4. 
This clustering complements the description of the data in Section E751 


5 Conclusions 

We have demonstrated that using BIA across a variety of model-based clustering applications can 
lead to higher convergent likelihoods being reached and that optimal likelihoods can be reached 
more frequently than under other competing methods. The importance for the clustering solution 
of reaching the optimal likelihood was also clearly demonstrated. A sub-optimal convergent log- 
likelihood can lead to an inferior, potentially misleading clustering solution. 

The BIA method improves upon mclust’s hierarchical clustering initialization method for the 
multivariate continuous AIS data and reaches a higher convergent likelihood. This higher likeli¬ 
hood in turn changes the clustering solution of the data. While for this example it is not a major 
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Fig. 11 Histogram of local maxima obtained for the stochastic blockmodel (SBM) applied to the Karate data from 
200 random starting values. 


change, it does highlight that the clustering solutions obtained by mclust are not guaranteed to 
be optimal. For the simulation experiment using mixtures of overlapping Gaussian components, 
BIA correctly identifies both relevant solutions whereas hierarchical clustering initialization almost 
always identifies only the 3 group possibility. For th e categorical data examples, the BIA results are 
reasonably close to those of IZhou and Lanra ( 201fll l with respect to the Carcinoma data, while the 
BIA method comprehensively outperforms the annealing approach for the Alzheimer’s data. For 
the network data example using the Karate data, the BIA approach attains the global maximum 
likelihood for the 4-group model far more consistently and efficiently than a simple random starts 
approach. 

While the illustrative data sets used were moderate in terms of size and dimensionality, the BIA 
approach could prove increasingly vital as these quantities increase. This is especially true for net¬ 
work data, where, in its simplest form, the size of a data set increases quadratically with the sample 
size. Alternatively, the approach may also prove useful for more sophisticated models with addi¬ 
tional parameters to be estimated. For exam ple, in the case of cont i nuous data, a mixture of skewed , 
heavy-tailed distributions s uch a s the skew-t ( McLachlan and Peell . 19981 : Andrews and McNicholas , 


201 It iLee and McLachlanl l2012h may be preferred to a mixture of Gaussians. 


An increased number of EM runs and start ing positions will genera lly boost performance of the 
BIA approach, but as in the case of tempering ( Zhou and Lang j . 201Clll there is no single systematic 
rule for selecting an optimum combination, and some trial and error is required. However, extensive 
experimentation on the motivating data sets has shown that the results presented are robust for 
wide ranges of number of EM runs and number of starting positions. Intuitively it seems reasonable 
to expect that, if only one of these two inputs is to be elevated, one should favour many starting 
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Fig. 12 Plot of the Karate network data set, with 4 groups fitted to the data. 


positions with a relatively small number of runs for each, since the EM algorithm usually increases 
likelihood most substantially in the early iterations. For the data sets analyzed, this intuition holds 
for the Karate, AIS and Carcinoma data, but not the Alzheimer’s data. 

In the examples shown, different starting values were generated using standard approaches. This 
was done to ensure simple and fair settings were used to compare BIA with competing initialization 
methods such as random starts, hierarchical clustering, and deterministic annealing. However, this 
does not have to be the case. For example, in the case of continuous data, clustering solutions 
obtained from the different covariance structures available in mclust could be used. An advantage 
of the BIA approach is that rather than discarding competing heuristic approaches, they may 
be assimilated in a principled, statistically robust manner that efficiently harnesses the available 
computational resources. 
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